Learning from Multiple Sources of Inaccurate Data

نویسندگان

  • Ganesh Baliga
  • Sanjay Jain
  • Arun Sharma
چکیده

Most theoretical models of inductive inference make the idealized assumption that the data available to a learner is from a single and accurate source. The subject of inaccuracies in data emanating from a single source has been addressed by several authors. The present paper argues in favor of a more realistic learning model in which data emanates from multiple sources, some or all of which may be inaccurate. Three kinds of inaccuracies are considered: spurious data (modeled as noisy texts), missing data (modeled as incomplete texts), and a mixture of spurious and missing data (modeled as imperfect texts). Motivated by the above argument, the present paper introduces and theoretically analyzes a number of inference criteria in which a learning machine is fed data from multiple sources, some of which may be infected with inaccuracies. The learning situation modeled is the identification in the limit of programs from graphs of computable functions. The main parameters of the investigation are: kind of inaccuracy, total number of data sources, number of faulty data sources which produce data within an acceptable bound, and the bound on the number of errors allowed in the final hypothesis learned by the machine. Sufficient conditions are determined under which, for the same kind of inaccuracy, for the same bound on the number of errors in the final hypothesis, and for the same bound on the number of inaccuracies, learning from multiple texts, some of which may be inaccurate, is equivalent to learning from a single inaccurate text. The general problem of determining when learning from multiple inaccurate texts is a restriction over learning from a single inaccurate text turns out to be combinatorially very complex. Significant partial results are provided for this problem. Several results are also provided about conditions under which the detrimental effects of multiple texts can be overcome by either allowing more errors in the final hypothesis or by reducing the number of inaccuracies in the texts. It is also shown that the usual hierarchies resulting from allowing extra errors in the final program (results in increased learning power) and allowing extra inaccuracies in the texts (results in decreased learning power) hold. Finally, it is demonstrated that in the context of learning from multiple inaccurate texts, spurious data is better than missing data which in turn is better than a mixture of spurious and missing data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Bayesian Networks from Inaccurate Data A Study on the Effect of Inaccurate Data on Parameter Estimation and Structure Learning of Bayesian Networks

The wealth of data collected by automated systems, or pen and paper processes, and available on the World Wide Web, is staggering. With these vast amounts of data come amazing possibilities for data mining and learning of data patterns through automated and semi-automated tools. However, these tools are inherently reliant on the quality of the data with which they are supplied. Without a thorou...

متن کامل

A Theory of Complementarity for Extracting Accurate Data from Inaccurate Sources through Integration

The purpose of this research is to develop a theory that helps produce accurate data integration output given multiple, overlapping, inaccurate sources. At present, the theoretical foundations of solutions that center on source selection and conflict resolution with a similar goal are limited. This paper introduces a new solution approach and theory that center on notions of complementarity, ba...

متن کامل

Designing collaborative learning model in online learning environments

Introduction: Most online learning environments are challenging for the design of collaborative learning activities to achieve high-level learning skills. Therefore, the purpose of this study was to design and validate a model for collaborative learning in online learning environments. Methods: The research method used in this study was a mixed method, including qualitative content analysis and...

متن کامل

Using Fictional Sources in the Classroom: Applications from Cognitive Psychology

Fictional materials are commonly used in the classroom to teach course content. Both laboratory experiments and classroom demonstrations illustrate the benefits of using fiction to help students learn accurate information about the world. However, fictional sources often contain factually inaccurate content, making them a potent vehicle for learning misinformation about the world. We briefly re...

متن کامل

Multi-view Positive and Unlabeled Learning

Learning with Positive and Unlabeled instances (PU learning) arises widely in information retrieval applications. To address the unavailability issue of negative instances, most existing PU learning approaches require to either identify a reliable set of negative instances from the unlabeled data or estimate probability densities as an intermediate step. However, inaccurate negative-instance id...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • SIAM J. Comput.

دوره 26  شماره 

صفحات  -

تاریخ انتشار 1992